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ABSTRACT 

A general framework based on Gaussian models and a MAP- 
EM algorithm is introduced in this paper for solving ma- 
trix/table completion problems. The numerical experiments 
with the standard and challenging movie ratings data show 
that the proposed approach, based on probably one of the 
simplest probabilistic models, leads to the results in the same 
ballpark as the state-of-the-art, at a lower computational cost. 

Index Terms — Matrix completion, inverse problems, 
collaborative filtering, Gaussian mixture models, MAP esti- 
mation, EM algorithm 

1. INTRODUCTION 

Matrix completion amounts to estimate all the entries in a ma- 
trix F G jj MxAr from a partial and potentially noisy observa- 
tion 

Y = H»F + W, (1) 

where H G M. M * N j s a binary matrix with 1/0 entries masking 
a portion of F through the element-wise multiplication •, and 
W G jg^xN an additive noise. With matrix rows, columns, 
and entry values assigned to various attributes, matrix com- 
pletion may have numerous applications. For example, when 
the rows and the columns of F are attributed to users and items 
such as movies and books, and an entry at position (m,n) 
records a score given by the ra-fh user to the «-th item, matrix 
completion predicts users' scores on the items they have not 
yet rated, based on the available scores recorded in Y, so that 
personalized item recommendation system becomes possible. 
This is a classical problem in collaborative filtering. 

To solve an ill-posed matrix completion problem, one 
must rely on some prior information, or in other words, some 
data model. The most popular family of approaches in the 
literature assumes that the matrix F follows approximately a 
low rank model, and calculates the matrix completion with 
a matrix factorization ll4l 1141 . Theoretical results regarding 
the completion of low-rank matrices have been recently ob- 
tained as well, e.g., |f3] \l3l and references therein. More 
elaborative probabilistic models and some refinement have 
been further studied on top of matrix factorization, leading to 
state-of-the-art results f7ll8l[l7l . 

In image processing, assuming that local image patches 
follow Gaussian mixture models (GMM), Yu, Sapiro and 



Mallat have recently reported excellent results in a number of 
inverse problems [ 16 1. In particular, for inpainting, which is 
an analogue to matrix completion where the data is an image, 
the maximum a posteriori expectation-maximization (MAP- 
EM) algorithm for solving the GMM leads to state-of-the-art 
results, with a computational complexity considerably re- 
duced with respect to the previous state-of-the-art approaches 
based on sparse models [9|, (which are analogous to the 
low-rank model assumption for matrices). 

In this paper, we investigate Gaussian modeling (a par- 
ticular case of GMM with only a single Gaussian distribu- 
tion) for matrix completion. Subparts of the matrix, typically 
rows or columns, are regarded as a collection of signals that 
are assumed to follow a Gaussian distribution. An efficient 
MAP-EM algorithm is introduced to estimate both the Gaus- 
sian parameters (mean and covariance) and the signals. We 
show through numerical experiments that the fast MAP-EM 
algorithm, based on the Gaussian model which is the simplest 
probabilistic model one can imagine, leads to results in the 
same ballpark as the state-of-the-art in movie rating predic- 
tion, at significantly lower computational cost. Recent theo- 
retical results lfl5l further support the consideration of Gaus- 
sian models for the recovery of missing data. 

Section [2] introduces the Gaussian model and the MAP- 
EM algorithm. After presenting the numerical experiments in 
Section^ Section|4]concludes with some discussions. 



2. MODEL AND ALGORITHM 
2.1. Gaussian Model 

Similar to the local patch decomposition often applied in im- 
age processing ifTBI . let us consider each subpart of the matrix 
F G R MxN , the i'-th row f, G R N for example, as a signal. Let 
h, G M. N denote the 2-th row in the binary matrix H, and let 
Nj= | { j|hj (j) 7^ 0, 1 < j < N} | be the number of non-zero en- 
tries in h,-. Let U, G M. N ' xN denote a masking operator which 
maps from M." to M."' extracting entries of f,- corresponding to 
the non-zero entries of h,-, i.e., all but the (Idx(h,,fc))-th the 
entries in the k-th row of U, are zero, with (Idx(h,,fc)) the in- 
dex of the k-th non-zero entry in h,, 1 <k< Nj. Let y,- G R^' 
and Wj G R ' be respectively the sub-vector of the 2-th row 
of Y and W, where the entries of h, are non-zero. With this 



notation, we can rewrite (Q]) in a more general linear model 

y,-=U,-f,- + w„ (2) 

for all the signals f,, 1 < i < M. Q Note that f; can also be 
columns, or 2D sub-matrices of F rendered in ID. 

The Gaussian model assumes that each signal f, is drawn 
from a Gaussian distribution, with a probability density func- 
tion 

p ® = (2^172 ex P ("^ - ^ 1 » " M)) • ( 3 ) 

where E and are the unknown covariance and mean param- 
eters of the Gaussian distribution. The noise w< is assumed to 
follow a Gaussian distribution with zero mean and covariance 
E w , here assumed to be known (or calibrated). 

Estimating the matrix F from the partial observation Y 
can thus be casted into the following problem: 

• Estimate the Gaussian parameters from the ob- 
servation {y,}i<KM- 

• Estimate f, from y,, 1 <i<M, using the Gaussian dis- 
tribution .jV(\iX)- 

Since this is a non-convex problem, we present an effi- 
cient maximum a posteriori expectation-maximization (MAP- 
EM) algorithm that calculates a local-minimum solution [ 1 

2.2. Algorithm 

Following a simple initialization, addressed in Section 12.2.31 
the MAP-EM algorithm is an iterative procedure that alter- 
nates between two steps: 

1. E-step: signal estimation. Assuming the estimates 
(A , E) are known (following the previous M-step), for 
each i one computes the maximum a posteriori (MAP) 
estimate f, of f,. 

2. M-step: model estimation. Assuming the signal esti- 
mates f„ Vz, are known (following the previous E-step), 
one estimates (updates) the Gaussian model parameters 

2.2.1. E-step: Signal Estimation 

It is well known that under the Gaussian models assumed 
in Section 12.11 the MAP estimate that maximizes the log a- 
posteriori probability 

?; = argiraxlogp(fi|yi,£,£) 

= argmax (logp(y,-[f,-) + logp(f ; |A,£)) 

= argnun ((U/f/ - y/) r Z w 1 (U,-f/ - y,) + (f; - A) 7 ! -1 (f; - A)) 

= A+£uf(u,£uf+s w )- 1 (y/-U/A), (4) 

1 Writing y, in the reduced dimension Ni leads to a calculation in dimen- 
sion Nj instead of N, which is considerably faster if Nj -C N. 



is a linear estimator and is optimal in the sense that it mini- 
mizes the mean square error (MSE), i.e., Ef. w . [||f,- — ? 1 1 1 1 ] = 
minjE'f. w . [||f/ — ^(y,-)!!!], as well as the mean absolute er- 
ror (MAE), i.e., ^.[Hfi-fiHi] = min^Ef. W| [||f, -g(y t )\\i], 
where g is any mapping from M. N — > R"' J6j. The second 
equality of (H} follows from the Bayes rule, the third follows 
from the Gaussian models f, ~ jY (jU,E) and w,- ~ o/K(0,E w ), 
and the last is obtained by deriving the third line with respect 
to f,. 

The close-form MAP estimate © can be calculated fast. 
Observe that U,- G K ' xN is a sparse extraction matrix, each 
row containing only one non-zero entry with value 1, whose 
index corresponds to the non-zero entry in h,. Therefore, 
the multiplication operations that involve U, or Uf can be 
realized by extracting the appropriate rows or columns at 
zero computational cost. The complexity of © is there- 
fore dominated by the matrix inversion. As U,EUf + E w is 
positive-definite, (U,EUf + E W )~' can be implemented with 
Nf /3 flops through a Cholesky factorization 0. 

In a typical case where f, is the z'-th row of the matrix 
F, 1 < i < M, to estimate F the total complexity of the E- 
step is therefore dominated by Y!u=\Nf /2> flops. For typical 
rating prediction datasets that are highly sparse, among a large 
number of items N, most users have rated more or less only a 
small number Af, « N/C of items, where C is large. The total 
complexity of the E-step is thus dominated by -^MN 3 flops. 

2.2.2. M-step: Model Estimation 

The parameters of the model are estimated/updated with the 
maximum likelihood estimate, 

(/i,£) = argmaxlogp({f,} 1 <K M |^i,E). (5) 

With the Gaussian model f, ~ ,yV(jj.,T.), it is well known that 
the resulting estimate is the empirical one, 

i M i M 

A = Jj £ f i and t,= — £ (f,- - A) (f, - A) r ■ (6) 

The empirical covariance estimate may be improved 
through regularization when there is lack of data (let us 
take an example of standard rating prediction, where there 
are N ~ 10 3 items and M ~ 10 4 users, the dimension of the 
covariance matrix T.^ is N x N ^ 10 6 ). A simple and standard 
eigenvalue-based regularization is used here, 

£<-£ + eld, (7) 

where e is a small constant. The regularization also guaran- 
tees that the estimate E of the covariance matrix is full-rank, 
so that (0]i is always well defined. 

To estimate F, the computational complexity of the M- 
step is dominated by the calculation of the empirical covari- 
ance estimate requiring MN 2 flops, which is negligible with 
respect to the E-step. 



As the MAP-EM algorithm iterates, the MAP probability 
of the observed signals p(f,|y,,/l,£) increases. This can be 
observed by interpreting the E- and M-steps as a coordinate 
descent optimization J5). In the experiments, the algorithm 
converges within 10 iterations. 

2.2.3. Initialization 

The MAP-EM algorithm is initialized with an initial guess of 
f,-, Mi. The experiments show that the result is insensitive to 
the initialization for movie rating prediction. In the numerical 
experiments, all the unknown entries are initialized to 3 for 
datasets containing ratings ranging from 1 to 5 or 6. 

2.2.4. Computational and Memory Requirements 

In a typical case where the matrix row decomposition in (|2]i 
is considered, the overall computational complexity of the 
MAP-EM to estimate an M x N matrix is dominated by 
^LMN 3 , with L the number of iterations (typically < 10) 
and 1 /C the available data ratio, with C typically large (m 25 
for the standard movie ratings data). The algorithm is thus 
very fast. As each row f, is treated as a signal and the signals 
can be estimated in sequence, the memory requirement is 
dominated by N 2 (to store the covariance matrix E). 

3. NUMERICAL EXPERIMENTS 
3.1. Experimental Protocols 

The experimental protocols strictly follow those described in 
the literature HE [TO] [12 El E). The proposed method is 
evaluated on two movie rating prediction benchmark datasets, 
the EachMovie dataset and the 1M MovieLens dataset. □ 
Two test protocols, the so-called "weak generalization" and 
"strong generalization," lITOl are applied. 

• Weak generalization measures the ability of a method 
to generalize to other items rated by the same users used 
for training the method. One known rating is randomly 
held out from each user's rating set to form a test set, 
the rest known ratings form a training set. The method 
is trained using the data in the training set, and its per- 
formance is evaluated over the test set. 

• Strong generalization measures the ability of the 
method to generalize to some items rated by novel 
users that have not been used for training. The set of 
users is first divided into training users and test users. 
Learning is performed with all available ratings from 
the training set. To test the method, the ratings of each 
test users are further split into an observed set and a 
held-out set. The method is shown the observed ratings, 
and is evaluated by predicting the held-out ratings. 

2 http://www. grouplens.org/ 



The EachMovie dataset contains 2.8 million ratings in the 
range { 1 , . . . , 6} for 1 ,648 movies (columns) and 74,424 users 
(rows). Following the standard procedure [10], users with 
fewer than 20 ratings and movies with less than 2 ratings are 
removed. This leaves us 36,656 users, 1,621 movies, and 2.5 
million ratings (available data ratio 4.3%). We randomly se- 
lect 30,000 users for the weak generalization, and 5,000 users 
for the strong generalization. The 1M MovieLens dataset con- 
tains 1 million ratings in the range { 1 , . . . , 5} for 3,900 movies 
(columns) and 6,040 users (rows). The same filtering leaves 
us 6,040 users, 3,592 movies, and 1 million ratings (available 
data ratio 4.6%). We randomly select 5,000 users for the weak 
generalization, and 1,000 users for the strong generalization. 
Each experiment is run 3 times and the average reported. 

The performance of the method is measured by the stan- 
dard Normalized Mean Absolute Error (NMAE), computed 
by normalizing the mean absolute error by a factor for which 
random guessing produces a score of 1 . The factor is 1 .944 
for EachMovie, and 1.6 for MovieLens. 

3.2. Proposed Method Setup 

In contrast to most exiting algorithms in the literature, the pro- 
posed method, thanks to its simplicity, enjoys the advantage 
of having very few intuitive parameters. The covariance regu- 
larization parameter £ in (0 is set equal to 0.3 (whose square 
root is one order of magnitude smaller than the maximum rat- 
ing), the results being insensitive to this value as shown by 
the experiments. The number of iterations of the MAP-EM 
algorithm is fixed at L = 10, beyond which the convergence 
of the algorithm is always observed. The noise w, in (O is 
neglected, i.e., E w is set to 0, as the movie rating datasets 
mainly involve missing data, the noise being implicit and as- 
sumed negligible. 

The experiments show that considering row f, £ M. N of 
the matrix F S M. MxN as signals leads to slightly better re- 
sults than taking columns or 2D sub-matrices. This means 
that each user is a signal, whose dimension is the number of 
movies. 

As in previous works [ 17|, a post-processing that projects 
the estimated rating to an integer within the rating range is 
performed. 

A Matlab code of the proposed algorithm is available at 
http://www.cmap.polytechnique.fr/~yu/research/MC/demo.html 

3.3. Results 

The results of the proposed method are compared with the 
best published ones including User Rating Profile (URP) ifTOll , 
Attitude lUJJ, Maximum Margin Matrix Factorization (MMMF) |fT4l . 
Ensemble of MMMF (E-MMMF) [4|, Item Proximity based 
Collaborative Filtering (IPCF) [12|, Gaussian Process Latent 
Variable Models (GPLVM) Q, Mixed Membership Matrix 
Factorization (M 3 F) (|J, and Nonparametric Bayesian Matrix 
Completion (NBMC) lUTl . For each of these methods, more 
than one results produced with different configurations are 



often reported, among which we systematically cite the best 
one. All these methods are significantly more complex than 
the one here proposed. 

Tables Q] and [2] presents the results of various methods 
for both weak and strong generalizations on the two datasets. 
NBMC generates most often the best results, followed closely 
by the proposed method referred to as GM (Gaussian Model) 
and GPLVM, all of them outperforming the other methods. 
The results produced by the proposed GM, with a by far sim- 
pler model and faster algorithm, is in the same ballpark as 
those of NBMC and GPLVM, the difference with NMAE be- 
ing smaller than about 0.005, marginal in the rating range that 
goes from 1 to 5 or 6. 



EachMovie 


Weak NMAE 


Strong NMAE 


URP [10| 


0.4422 


0.4557 


Attitude |11] 


0.4520 


0.4550 


MMMF 1 14] 


0.4397 


0.4341 


IPCF |12| 


0.4382 


0.4365 


E-MMMF |4| 


0.4287 


0.4301 


GPLVM |7| 


0.4179 


0.4134 


M j F |8 1 


0.4293 


n/a 


NBMC 1 17 1 


0.4109 


0.4091 


GM (proposed) 


0.4164 


0.4163 



Table 1. NMAEs generated by different methods for Each- 
Movie database. The smallest NMAE is in boldface. 



1M MovieLens 


Weak NMAE 


Strong NMAE 


URP [10| 


0.4341 


0.4444 


Attitude |11] 


0.4320 


0.4375 


MMMF 1 14] 


0.4156 


0.4203 


IPCF 1 12 1 


0.4096 


0.4113 


E-MMMF |4| 


0.4029 


0.4071 


GPLVM |7| 


0.4026 


0.3994 


M J F|8| 


n/a 


n/a 


NBMC |17 1 


0.3916 


0.3992 


GM (proposed) 


0.3959 


0.3928 



Table 2. NMAEs generated by different methods for 1M 
MovieLens database. The smallest NMAE is in boldface. 

4. CONCLUSION AND DISCUSSION 

We have shown that a Gaussian model and a MAP-EM algo- 
rithm provide a simple and computational efficient solution 
for matrix completion, leading to results in the same ballpark 
as state-of-the-art ones for movie rating prediction. 

Future work may go in several directions. On the one 
hand, the proposed conceptually simple and computationally 
efficient method may provide a good baseline for further re- 
finement, for example by incorporating user and item bias or 
meta information flTl . On the other hand, Gaussian mix- 
ture models (GMM) that have been shown to bring dramatic 



improvements over single Gaussian models in image inpaint- 
ing JT6 |, are expected to better capture different characteris- 
tics of various categories of movies (comedy, action, etc.) and 
classes of users (age, gender, etc.). However, no significant 
improvement by GMM over Gaussian model has yet been ob- 
served for movie rating prediction. This needs to be further 
investigated, and such improvement might come from proper 
grouping and initialization. 
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